Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
translated by 谷歌翻译
Clustering has been extensively studied in centralized settings, but relatively unexplored in federated ones that data are distributed among multiple clients and can only be kept local at the clients. The necessity to invest more resources in improving federated clustering methods is twofold: 1) The performance of supervised federated learning models can benefit from clustering. 2) It is non-trivial to extend centralized ones to perform federated clustering tasks. In centralized settings, various deep clustering methods that perform dimensionality reduction and clustering jointly have achieved great success. To obtain high-quality cluster information, it is natural but non-trivial to extend these methods to federated settings. For this purpose, we propose a simple but effective federated deep clustering method. It requires only one communication round between the central server and clients, can run asynchronously, and can handle device failures. Moreover, although most studies have highlighted adverse effects of the non-independent and identically distributed (non-IID) data across clients, experimental results indicate that the proposed method can significantly benefit from this scenario.
translated by 谷歌翻译
初始化时(OPAI)的一次性网络修剪是降低网络修剪成本的有效方法。最近,人们越来越相信数据在OPAI中是不必要的。但是,我们通过两种代表性的OPAI方法,即剪切和掌握的消融实验获得了相反的结论。具体而言,我们发现信息数据对于增强修剪性能至关重要。在本文中,我们提出了两种新颖的方法,即判别性的单发网络修剪(DOP)和超级缝制,以通过高级视觉判别图像贴片来修剪网络。我们的贡献如下。(1)广泛的实验表明OPAI是数据依赖性的。(2)超级缝线的性能明显优于基准图像网上的原始OPAI方法,尤其是在高度压缩的模型中。
translated by 谷歌翻译
像素合成是图像生成的有前途的研究范式,可以很好地利用像素的先验知识来生成。但是,现有方法仍然遭受过多的内存足迹和计算开销。在本文中,我们提出了一个渐进的像素合成网络,用于有效的图像生成,以像素型构成。具体而言,PixelFolder将图像生成作为渐进的像素回归问题制定,并通过多阶段结构合成图像,这可以大大减少由大型张量转换引起的开销。此外,我们引入了新型的像素折叠操作,以进一步提高模型效率,同时保持像素的先验知识以进行端到端回归。通过这些创新的设计,我们大大减少了像素合成的支出,例如,与最新的像素合成方法CIPS相比,减少了89%的计算和53%的参数。为了验证我们的方法,我们在两个基准数据集(即FFHQ和LSUN教堂)上进行了广泛的实验。实验结果表明,PixelFolder的支出要少得多,在两个基准数据集上获得了新的最先进(SOTA)性能,即3.77 FID和2.45 FID在FFHQ和LSUN教堂上。比SOTA方法效率高,例如stylegan2,分别降低了约72%的计算和31%的参数。这些结果极大地验证了所提出的像素的有效性。
translated by 谷歌翻译
面部草图合成已被广泛用于多媒体娱乐和执法。尽管深度神经网络最近发生了进展,但由于人脸的多样性和复杂性,准确而现实的面孔素描合成仍然是一项艰巨的任务。当前基于图像到图像翻译的面孔草图合成在小型数据集时通常会遇到过度拟合的问题。为了解决此问题,我们提出了面部绘制的端到端以内存的样式转移网络(最多)的范围,该网络(最多)可以产生具有有限数据的高保真草图。具体而言,引入了外部自我监督的动态内存模块,以捕获域对准知识。这样,我们提出的模型可以通过在特征级别上建立面部和相应草图之间的耐用关系来获得域转移能力。此外,我们为记忆模块中的特征比对设计了一种新颖的记忆细化损失(MR损失),该功能对齐可增强记忆插槽的准确性。在CUFS和CUFSF数据集上进行了广泛的实验表明,我们最网络可以实现最先进的性能,尤其是在结构相似性指数(SSIM)方面。
translated by 谷歌翻译
预测历史姿势序列的人类运动对于机器具有成功与人类智能相互作用的关键。到目前为止已经避免的一个方面是,我们代表骨骼姿势的事实是对预测结果的关键影响。然而,没有努力调查不同的姿势代表方案。我们对各种姿势表示进行了深入研究,重点关注它们对运动预测任务的影响。此外,最近的方法在现成的RNN单位上构建,用于运动预测。这些方法在捕获长期依赖性方面,顺序地并固有地具有困难。在本文中,我们提出了一种新颖的RNN架构,用于运动预测的AHMR(殷勤分层运动复发网络),其同时模拟局部运动上下文和全局上下文。我们进一步探索了运动预测任务的测地损失和前向运动学损失,其具有比广泛采用的L2损耗更多的几何意义。有趣的是,我们将我们的方法应用于一系列铰接物对象,包括人类,鱼类和鼠标。经验结果表明,我们的方法在短期预测中占据了最先进的方法,实现了大量增强的长期预测熟练程度,例如在50秒的预测中保留自然人样的运动。我们的代码已发布。
translated by 谷歌翻译
由于自我关注模块的二次空间和时间复杂性,基于变压器的模型在处理长序列中是不高的。为了解决此限制,建议通过分别通过低维投影和行选择来降低线性(模数对数因子)的二次复杂度。这两种型号本质上连接,并了解他们的连接,我们介绍了矩阵素描的理论框架。基于理论分析,我们提出了Skeinformer加速自我关注,进一步提高了三个精心设计的组件的自我关注的准确性:列采样,自适应行标准化和飞行员采样重新利用。关于长距离竞技场(LRA)基准的实验表明,我们的方法以始终如一的较小时间/空间占地面积优于替代方案。
translated by 谷歌翻译
ROC(AUROC)和精密召回曲线(AUPRC)的区域是用于评估不平衡问题的分类性能的常见度量。与AUROC相比,AUPRC是一个更合适的度量,用于高度不平衡的数据集。虽然已经广泛研究了Auroc的随机优化,但Auprc的原则随机优化已经很少被探索。在这项工作中,我们提出了一个原则的技术方法来优化Auprc进行深度学习。我们的方法是基于最大化平均精度(AP),这是Auprc的一个非偏见点估计器。我们将目标分为{\ IT依赖的组成函数}的总和,内部函数取决于外层的随机变量。通过利用随机成分优化的最新进展,我们提出了具有{\ IT可提供的收敛保证的皂的适应性和非自适应随机算法。图像和图表数据集的广泛实验结果表明,我们所提出的方法在AUPRC方面占据了对不平衡问题的现有方法。据我们所知,我们的工作代表了第一次尝试使用可提供的融合优化AUPRC。 SOAP已在Libauc库中在〜\ URL {https://libauc.org/}中实现。
translated by 谷歌翻译
关于NLP模型的最先进攻击缺乏对成功攻击的共享定义。我们将思考从过去的工作蒸馏成统一的框架:一个成功的自然语言对抗性示例是欺骗模型并遵循一些语言限制的扰动。然后,我们分析了两个最先进的同义词替换攻击的产出。我们发现他们的扰动通常不会保留语义,38%引入语法错误。人类调查显示,为了成功保留语义,我们需要大大增加交换词语的嵌入和原始和扰动句子的句子编码之间的最小余弦相似之处。与更好的保留语义和语法性,攻击成功率下降超过70个百分点。
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译